Changed regex for calculation of percent hemoglobin genes #229

KriBaLin · 2023-08-15T09:28:55Z

Dear Theis lab,

thank you a lot for your very helpful book and tutorials.

I am currently performing my first analysis of scRNAseq data. During step 6.3 (filtering low quality reads) I wanted to understand the regex for filtering hemoglobin genes ("^HB[^(P)]").
I noticed that this regex not only includes hemoglobin-genes, but also the genes HBEGF, HBS1L, and HBP1.

I was trying to find a more specific regex to match only the hemoglobin genes, with some help from stackoverflow. I'd suggest "^HB(?!EGF|S1L|P1).+", which I changed in the jupyter notebook, an alternative might be "^HB[^(P|S)]($|[^G])".

This applies to human data, however we briefly confirmed that these regexs are applicable (with lowercase characters) to mouse data, too.

Please correct me if I am wrong and the original regex performs in the way intended by you. In this case, I would suggest extending the documentation for clarification.

Best,

Kristina

edit: added code backticks to the suggested regexs for correct display

review-notebook-app · 2023-08-15T09:28:58Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

Zethson · 2023-08-15T09:57:25Z

Dear @KriBaLin ,

thank you!

^HB(?!EGF|S1L|P1).+ seems a bit specific and I'm worried that there might be other genes that we're not excluding as False Positives here. Is this an unjustified fear by me?

So ^HB[^(P|S)] (which I think is equivalent to ^HB[^PS]?) might be a more appealing option if this is the case. Note that this would still match HBEGF...

What do you think?

KriBaLin · 2023-08-15T12:10:28Z

Dear @Zethson,

thank you for your fast reply.

Sorry, there was a formatting mistake in my first post that turned the suggested "^HB[^(P|S)]($|[^G])" into a wrong "^HB[^(P|S)]" - I edited the post now.

Regarding the expression being too specific, I'm honestly not experienced enough to judge this with respect to future changes of gene annotations or the like. Currently, when I search the 36601 genes of my human data set for genes starting with "HB", I get 13 hits: HBEGF, HBS1L, HBP1, HBB, HBD, HBG1, HBG2, HBE1, HBZ, HBM, HBA2, HBA1, HBQ1;

The first 3 don't seem to be hemoglobin-genes. The regex "^HB[^(P)]" only excludes HBP1, whilst "^HB(?!EGF|S1L|P1).+" and "^HB[^(P|S)]($|[^G])" exclude the first three. The former regex might be a bit easier to understand.

A (maybe more robust?) option could be to explicitly check for a list of hemoglobin genes - as suggested by Konrad Rudolph on stackoverflow.

Zethson · 2023-08-15T12:13:06Z

Guess one could look at Ensemble gene symbols to see how this regex would affect it. A list of genes is also possible but then we'd need the list ^_^

grst · 2023-08-29T14:38:27Z

I agree with @klmr that an explicit list is preferable over a regex. Not sure what a "trusted source" of hemoglobin genes would be, but results 1-10 from this genescards search is probably a good start. At least it for sure doesn't include anything unexpected.

Zethson · 2023-08-29T14:46:05Z

Thank you very much @grst for the link! We'll make the changes accordingly using the list.

adc0032 · 2024-11-14T20:54:05Z

I found this helpful, but would like to note for anyone else looking around for this that the pseudogenes in the mouse genome are denoted with a -p (e.g., Hba-ps4). I went with "^Hb[abdegmqz]-(?!p)|Hb[abdegmqz][0-9][a-z]" to match this list

I am planning to use "^HB[ABDEGMQZ]\d*(?!\w)" to match human. It seems like there are set alphabetic Greek letters for hemoglobin possibly followed by a number.

Changed regex for calculation of percent hemoglobulin genes

d4d7a63

KriBaLin changed the title ~~Changed regex for calculation of percent hemoglobulin genes~~ Changed regex for calculation of percent hemoglobin genes Aug 15, 2023

Zethson requested a review from AnnaChristina August 15, 2023 09:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed regex for calculation of percent hemoglobin genes #229

Changed regex for calculation of percent hemoglobin genes #229

KriBaLin commented Aug 15, 2023 •

edited

Loading

review-notebook-app bot commented Aug 15, 2023

Zethson commented Aug 15, 2023

KriBaLin commented Aug 15, 2023 •

edited

Loading

Zethson commented Aug 15, 2023

grst commented Aug 29, 2023

Zethson commented Aug 29, 2023

adc0032 commented Nov 14, 2024

Changed regex for calculation of percent hemoglobin genes #229

Are you sure you want to change the base?

Changed regex for calculation of percent hemoglobin genes #229

Conversation

KriBaLin commented Aug 15, 2023 • edited Loading

review-notebook-app bot commented Aug 15, 2023

Zethson commented Aug 15, 2023

KriBaLin commented Aug 15, 2023 • edited Loading

Zethson commented Aug 15, 2023

grst commented Aug 29, 2023

Zethson commented Aug 29, 2023

adc0032 commented Nov 14, 2024

KriBaLin commented Aug 15, 2023 •

edited

Loading

KriBaLin commented Aug 15, 2023 •

edited

Loading